Voice barge-in in telephony speech recognition

ABSTRACT

An interactive voice response system is described that supports full duplex data transfer to enable the playing of a voice prompt to a user of telephony system while the system listens for voice barge-in from the user. The system includes a speech detection module that may utilize various criteria such as frame energy magnitude and duration thresholds to detect speech. The system also includes an automatic speech recognition engine. When the automatic speech recognition engine recognizes a segment of speech, a feature extraction module may be used to subtract a prompt echo spectrum, which corresponds to the currently playing voice prompt, from an echo-dirtied speech spectrum recorded by the system. In order to improve spectrum subtraction, an estimation of the time delay between the echo-dirtied speech and the prompt echo may also be performed.

FIELD OF THE INVENTION

[0001] The present invention relates to the field of speech recognitionand, in particular, to voice barge-in for speech recognition basedtelephony applications.

BACKGROUND OF THE INVENTION

[0002] Speech recognition based telephony systems are used by businessesto answer phone calls with a system that engages users in naturallanguage dialog. These systems use interactive voice response (IVR)telephony applications for a spoken language interface with a telephonysystem. IVR applications enable users to interrupt the system output atany time, for example, if the output is based on an erroneousunderstanding of a user's input or if it contains superfluousinformation that a user does not want to hear. Barge-in allows a user tointerrupt a prompt being played using voice input. Enabling barge-in maysignificantly enhance the user's experience by allowing the user tointerrupt the system prompt, whenever desired, in order to save time.Without barge-in, a user may react only when the system promptcompletes, otherwise the user's input is ignored by the system. This maybe very inconvenient to the user, particularly when the prompt is longand the user already knows the prompt message.

[0003] In today's touch tone based IVR systems, barge-in is widelyadopted. However, for speech recognition based IVR systems, barge-inposes to be a much greater challenge due to background noise and echofrom a prompt that may be transmitted to a voice recognition system.

[0004] One method of barge-in, referred to as key barge-in, is to stopplaying a prompt and be ready to process a user's speech after the userpresses a special key, such as the “#” or “*” key. One problem with sucha method is that the user must be informed of how to use it. As such,another prompt may need to be added to the system, thereby undesirablyincreasing the amount of user interaction time with the system.

[0005] Another method of barge-in, referred to as voice barge-in,enables a user to speak directly to the system to interrupt the prompt.FIG. 1 illustrates how barge-in occurs during prompt play in a voicebarge-in system. Such a method uses speech detection to detect a user'sspeech while the prompt is playing. Once the user' speech is detected inthe incoming data, the system stops playing and immediately begins arecord phase in which the incoming data is made available to a speechrecognition engine. The speech recognition engine processes the user'sspeech.

[0006] Although, such a method may provide a better solution than keybarge-in, the voice barge-in function of current IVR systems has severalproblems. One problem with current IVR systems is that thecomputer-telephone cards used in these systems may not supportfull-duplex data transfer. Another problem with current IVR systems isthat they may not be able to detect speech robustly from backgroundnoise, non-speech sounds, irrelevant speech and/or prompt echo. Forexample, the prompt echo that resides in these systems may significantlydegrade speech quality. Using traditional adaptive filtering methods toremove near-end prompt echo may significantly degrade the performance ofautomatic speech recognition engines used in these systems.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The present invention is illustrated by way of example, and notby way of limitation, in the figures of the accompanying drawings and inwhich like reference numerals refer to similar elements and in which:

[0008]FIG. 1 illustrates barge-in during prompt play in a voice barge-insystem.

[0009]FIG. 2 illustrates one embodiment of an interactive voice responsetelephony system.

[0010]FIG. 3 illustrates one embodiment of a method of implementing aninteractive voice response system.

[0011]FIG. 4 illustrates one embodiment of speech detection in an inputsignal.

[0012]FIG. 5 illustrates one embodiment of a feature extraction method.

[0013]FIG. 6 illustrates an embodiment of a feature extraction methodfor a particular feature.

DETAILED DESCRIPTION

[0014] In the following description, numerous specific details are setforth such as examples of specific systems, components, modules, etc. inorder to provide a thorough understanding of the present invention. Itwill be apparent, however, to one skilled in the art that these specificdetails need not be employed to practice the present invention. In otherinstances, well known components or methods have not been described indetail in order to avoid unnecessarily obscuring the present invention.

[0015] The present invention includes various steps, which will bedescribed below. The steps of the present invention may be performed byhardware components or may be embodied in machine-executableinstructions, which may be used to cause a general-purpose orspecial-purpose processor programmed with the instructions to performthe steps. Alternatively, the steps may be performed by a combination ofhardware and software.

[0016] The present invention may be provided as a computer programproduct, or software, that may include a machine-readable medium havingstored thereon instructions, which may be used to program a computersystem (or other electronic devices) to perform a process according tothe present invention. A machine readable medium includes any mechanismfor storing or transmitting information in a form (e.g., software)readable by a machine (e.g., a computer). The machine-readable mediummay includes, but is not limited to, magnetic storage medium (e.g.,floppy diskette); optical storage medium (e.g., CD-ROM); magneto-opticalstorage medium; read only memory (ROM); random access memory (RAM);erasable programmable memory (e.g., EPROM and EEPROM); flash memory;electrical, optical, acoustical or other form of propagated signal(e.g., carrier waves, infrared signals, digital signals, etc.); or othertype of medium suitable for storing electronic instructions.

[0017]FIG. 2 illustrates one embodiment of an interactive voice responsetelephony system. IVR system 200 allows for a spoken language interfacewith telephony system 290. IVR system 200 supports voice barge-in byenabling a user to interrupt a prompt being played using voice input. Inone embodiment, IVR system 200 includes interface module 205 and voiceprocessing module 225. Interface module 205 provides interface circuitryfor direct connection of voice processing module 225 with line 203carrying voice data. Line 203 may be an analog or a digital line.

[0018] Interface module 205 includes voice input device 210 and voiceoutput device 220. Voice input device 210 and voice output device 220may be routed together using bus 215 to support full-duplex datatransfer. Voice input device 210 provides for voice data transfer fromtelephony system 290 to voice processing module 225. Voice output device220 provides for voice data transfer from voice processing module 225 totelephony system 290. For example, voice output device 220 may be usedto play a voice prompt to a user of telephony system 290 while voiceinput device 210 is used to listen for barge-in (e.g., voice or key)from a user.

[0019] In one embodiment, for example, voice devices 210 and 220 may beDialogic D41E cards, available from Dialogic Corporation of Parsippany,N.J. Dialogic's SCbus routing function may be used to establishcommunications between the Dialogic D41E cards. In alternativeembodiment, voice devices from other manufacturers may be used, forexample, cards available from Natural Microsystems of Framingham, Mass.

[0020] In one embodiment, voice processing module 225 may be implementedas a software processing module. Voice processing module 225 includesspeech detection module 230, feature extraction module 240, automaticspeech recognition (ASR) engine 250, and prompt generation module 260.Speech detection module 230 may be used to detect voice initiation inthe data signal received from voice input device 210. Feature extractionmodule 240 may be used to extract features used by ASR engine 250 andremove prompt from input signal 204. A feature is a representation of aspeech signal that is suitable for automatic speech recognition. Forexample, a feature may be Mel-Frequency Cepstrum Coefficients (MFCC) andtheir first and second order derivatives, as discussed below in relationto FIG. 6. As such, feature extraction may be used to obtain a speechfeature from the original speech signal waveform.

[0021] ASR engine 250 provides the function of speech recognition. Input231 to ASR engine 250 contains vectors of speech. ASR engine 250 outputs241 a recognition result as a word string. When ASR engine 250recognizes a segment of speech, according to a particular prompt that isplaying, feature extraction module 240 cleans up the speech containingdata signal. For example, feature extraction module 240 may subtract thecorresponding prompt echo's spectrum from the echo-dirtied speechspectrum. In one embodiment, ASR engine may be, for example, an IntelSpeech Development Toolkit (ISDT) engine available from IntelCorporation of Santa Clara, Calif. In alternative embodiment, anotherASR engine may be used, for example, ViaVoice available from IBM ofArmonk, N.Y. ASR engines are known in the art; accordingly, a detaileddiscussion is not provided.

[0022] Prompt generation module 260 generates prompts using atext-to-speech (TTS) engine that converts text input into speech output.For example, the input 251 to prompt generation module 260 may be asentence text and the output 261 is a speech waveform of the sentencetext. TTS engines are available from industry manufacturers such asLucent of Murray Hill, N.J. and Lernout & Hauspie of Belgium. In analternative embodiment, a custom TTS engine may be used. TTS engines areknown in the art; accordingly, a detailed discussion is not provided.

[0023] After prompt waveform is generated, prompt generation module 260plays a prompt through voice output device 220 to the user of telephonysystem 290. It should be noted that in an alternative embodiment, theoperation of voice processing module 225 may be implemented in hardware,for example, is a digital signal processor.

[0024] Referring again to speech detection module 230, in oneembodiment, two criteria may be used to determine if input signal 204contains speech. One criteria may be based on frame energy. A frame is asegment of input signal 204. Frame energy is the signal energy withinthe segment. In one embodiment, if a segment of the detected inputsignal 204 contains speech, then it may be assumed that a certain numberof frames of a running window of frames will have their energy levelsabove a predetermined minimum energy threshold. The window of frames maybe either sequential or non-sequential. The energy threshold may be setto account for energy from non-desired speech, such as energy fromprompt echo.

[0025] In one embodiment, for example, a frame may be set to be 20milliseconds (ms), where speech is assumed to be short-time stationaryup to 20 ms; the number of frames may be set to be 8 frames; and therunning window may be set to be 10 frames. If, in this running window,the energy of 8 frames is over the predetermined minimum energythreshold then the current time may be considered as the start point ofthe speech. The energy threshold may be based on, for example, anaverage energy of prompt echo that is the echo of prompt currently beingplayed. In this manner, the frame energy threshold may be setdynamically. According to different echos of prompt, the frame energythreshold may be set as the average energy of the echo. The averageenergy of prompt echo may be pre-computed and stored when a prompt isadded into system 200.

[0026] Another criteria that may be used to determine if input signal204 contains speech is the duration of input signal 204. If the durationof input signal 204 is greater than a predetermined value then it may beassumed that input signal 204 contains speech. For example, in oneembodiment, it is assumed that any speech event lasts at least 300 ms.As such, the duration value may be set to be 300 ms.

[0027] After a possible start point of speech is detected, speechdetection module 230 attempts to detect the end point of the speechusing the same method as detecting the start point. The start point andthe end point of speech are used to calculate the duration. Continuingthe example, if the speech duration is over 300 ms then the possiblestart point of speech is a real speech start point and the currentspeech frames and successive speech frames may be sent to featureextraction module 240. Otherwise, the possible start point of speech isnot a real start point of speech and speech detection is reset. Thisprocedure lasts until an end point of speech is detected or input signal204 is over a maximum possible length.

[0028] Speech detection module 230 may also be used to estimate the timedelay of the prompt echo in input signal 204 if an echo cancellationfunction of system 200 is desired. While a prompt is added in system200, its waveform may be generated by prompt generation module 260. Thewaveform of the prompt is played once so that its echo is recorded andstored. When processing an input signal, correlation coefficientsbetween input signal 204 and the stored prompt echo is calculated withthe following equation:${C(\tau)} = {\sum\limits_{t = 1}^{T}\quad {{S( {t + \tau} )} \times {E(t)}}}$

[0029] where C is the correlation coefficients; S is input signal 204, Eis the prompt echo, T is the echo length, and τ is the time delayestimation of echo. The value of τ may range from zero to the maximumdelay time (e.g., 200 ms). After C is computed, the maximum value of Cin all τ is found. This value of τ is the time-delay estimation of echo.This value is used in the feature extraction module 240 when performingspectrum subtraction of the prompt echo spectrum to remove prompt echofrom the input signal 204 having echo dirtied speech, as discussedbelow.

[0030]FIG. 3 illustrates one embodiment of a method of implementing aninteractive voice response system. A prompt echo waveform 302 and aninput signal 301 are received by the system. In one embodiment, a speechdetection module may estimate the time delay time of a feature in inputsignal 301, step 310. The speech detection module may also be used todetect the existence of speech in input signal 302, in step 320. Theexistence of speech may be based on various criteria, such as amount offrame energy of the input signal and the duration of frame energy, asdiscussed below in relation to FIG. 4.

[0031] In step 330, feature extraction may be used to obtain a speechfeature from the original speech signal waveform. In one embodiment,prompt echo may be removed from input signal 301, using spectrumsubtraction, to facilitate the recognition of speech in the inputsignal. After feature extraction is performed, speech recognition may beperformed on input signal 301, step 340. A prompt may then be generated,step 350, based on the recognized speech.

[0032]FIG. 4 illustrates one embodiment of speech detection in an inputsignal. In one embodiment, the frame energy of an input signal may beused to determine if the input signal contains speech. An assumption maybe made that if the energy of the input signal, over a certain period oftime, is above a certain threshold level, then the signal may containspeech.

[0033] Thus, in one embodiment, an energy threshold for the input signalmay be set, step 410. The energy threshold is set higher than the promptecho energy so that the system will not consider the energy of promptecho in the input signal to be speech. In one embodiment, the energythreshold may be based on an average energy of the prompt echo that isthe echo of the prompt currently playing during the speech detection.The energy of the input signal is measured over a predetermined timeperiod, step 420, and compared against the energy threshold.

[0034] The input signal may be measured over time segments, or frames.In one embodiment, for example, a frame length of an input signal may be20 milliseconds in duration where speech is assumed to be a short-timestationary event up to 20 milliseconds. In step 430, the number ofenergy frames containing energy above the threshold is counted. If theenergy of the input signal over a predetermined number of frames (e.g.,8 frames) is greater than the predetermined energy threshold, then theinput signal may be considered to contain speech with that point of timeas the start of speech, step 440.

[0035] In one embodiment, the energy of the input signal may bemonitored over a running window of time. If in this running window(e.g., 10 frames) there is the predetermined number of frames (e.g., 8frames) over the predetermined energy threshold, then that point of timemay be considered as the start of speech.

[0036] In an alternative embodiment, another method of detecting thestart of speech may be used. For example, the rate of input signalenergy crossing over the predetermined threshold may be calculated. Ifthe measure rate exceeds a predetermine rate, such as a zero-crossthreshold rate, then the existence and start time of speech in the inputsignal may be determined.

[0037] If no speech is detected in the input signal, then adetermination may be made whether the period of silence (i.e.,non-speech) is too long, step 445. If a predetermined silence period isnot exceeded, then the system continues to monitor the input signal forspeech. If the predetermined silence period is exceeded, then the systemmay end its listening and take other actions, for example, errorprocessing (e.g., close the current call), step 447.

[0038] In one embodiment, the duration of frame energy of an inputsignal may also be used to determine if the input signal containsspeech. A possible start point of speech is detected as described abovein relation to steps 410 through 440. After a possible start point ofspeech is detected, then the end point of the speech is detected todetermine the duration of speech, step 450. In one embodiment, the endpoint of speech may be determined in a manner similar to that ofdetecting the possible start point of speech. For example, the energy ofthe input signal may be measured over another predetermined time periodand compared against the energy threshold. If the energy over thepredetermined time period is less than the energy threshold then thespeech in the input signal may be considered to have ended. In oneembodiment, the predetermined time in the speech end point determinationmay be the same as the predetermined time in the speech start pointdetermination. In an alternative embodiment, the predetermined time inthe speech end point determination may be different than thepredetermined time in the speech start point determination.

[0039] Once the end point of speech is determined, the duration of thespeech is calculated, step 460. If the duration is above a predeterminedduration threshold, then the possible start point of speech is a realspeech start point step 470. In one embodiment, for example, thepredetermined duration threshold may be set to 300 ms where it isassumed that any anticipated speech event lasts for at least 300 ms.

[0040] Otherwise, the possible start point of speech is not a real startpoint of speech and the speech detection may be reset. This procedurelasts until an end point of speech is detected or the input signal isover a maximum possible length, step 480.

[0041]FIG. 5 illustrates one embodiment of a feature extraction method.In one embodiment, an input signal and a prompt echo waveform arereceived, steps 515 and 525, respectively. A Fourier transformation isperformed to obtain a speech spectrum from the input signal, step 510. AFourier transformation may also be performed on the echo waveform togenerate a prompt echo spectrum, step 526.

[0042] In one embodiment, the prompt echo spectrum is shifted accordingto a time delay estimated between the input signal and the prompt echowaveform, step 519. The prompt echo spectrum is computed and subtractedfrom the speech spectrum, step 520. Afterwards, the Cepstrumcoefficients may be obtained for use by ASR engine 250 of FIG. 2 inperforming speech recognition, step 530.

[0043] In one embodiment, feature extraction involves the cancellationof echo prompt from the input signal, as discussed below in relation toFIG. 6. When ASR engine 250 of FIG. 2 recognizes a segment of speech,feature extraction may be used to subtract a prompt echo spectrum thatcorresponds to the currently playing prompt from echo-dirtied speechspectrum. In order to improve spectrum subtraction, an estimation of thetime delay between the echo-dirtied speech and the recorded echo may beperformed by speech detection module 230 of FIG. 2.

[0044]FIG. 6 illustrates an embodiment of a feature extraction methodfor a particular feature. In one embodiment, Mel-Frequency CepstrumCoefficients (MFCC) may be used to in performing speech recognition.Using a MFCC generation procedure, a Hamming window is added to theframe segment set for speech (e.g., 20 ms), step 610. A Fast FourierTransform (FFT) is calculated to obtain the speech spectrum, step 620.If the echo spectrum subtraction function is enabled, shift the echowaveform according to the time delay then compute the echo spectrum andsubtract the echo spectrum from the input signal spectrum, step 630.Next perform a logarithmic operation on the speech spectrum, step 640.Perform Mel-scale warping to reflect the non-linear perceptualcharacteristics of human hearing, step 650. Perform Inverse DiscreteTime Transformation (IDCT) to obtain the Cepstrum coefficients, step660. The resulting feature is a multiple (e.g., 12 dimension) vector.These parameters form the base feature of MFCC.

[0045] In one embodiment, the first and second derivatives of the basefeature are added to be the additional dimensions (the 13^(th) to24^(th) and 25^(th) to 36^(th) dimensions, respectively), to account fora change of speech over time. By using near-end prompt echocancellation, the performance of the ASR engine 250 of FIG. 2 may beimproved. In one embodiment, for example, the performance of the ASRengine 250 of FIG. 2 may improve by greater than 6%.

[0046] In the foregoing specification, the invention has been describedwith reference to specific exemplary embodiments thereof. It will,however, be evident that various modifications and changes may be madethereto without departing from the broader spirit and scope of theinvention as set forth in the following claims. The specification anddrawings are, accordingly, to be regarded in an illustrative senserather than a restrictive sense.

What is claimed is:
 1. A method, comprising: detecting an existence ofspeech in an input signal; and removing a prompt echo from the inputsignal using spectrum subtraction.
 2. The method of claim 1, whereinremoving the prompt echo comprises extracting a feature from inputsignal to generate a plurality of coefficients and wherein the methodfurther comprises: performing speech recognition on the input signalusing the plurality of coefficients; and generating a prompt in responseto particular speech recognized in the input signal.
 3. The method ofclaim 1, wherein the existence of speech is detected based on apredetermined energy in a plurality of segments of the input signal. 4.The method of claim 3, wherein the existence of speech is detected basedon a predetermined duration of the plurality of segments having thepredetermined energy.
 5. The method of claim 1, wherein detecting theexistence of speech in the input signal comprises: setting an energythreshold for the input signal, the input signal having a plurality ofsegments; and determining a start point of speech in the input signal,comprising: measuring an energy of the input signal for a firstpredetermined time; and determining whether the energy of the inputsignal for the first predetermined time is greater than the energythreshold.
 6. The method of claim 5, wherein detecting the existence ofspeech in the input signal further comprises: measuring a duration ofthe energy above the energy threshold; and determining whether theduration is greater than a predetermined duration threshold.
 7. Themethod of claim 1, wherein removing the prompt echo from the inputsignal comprises: estimating a time delay between the input signal andthe echo; obtaining a speech spectrum from the input signal, the speechspectrum including the echo; shifting the echo according to the timedelay; computing an echo spectrum using the shifted echo; andsubtracting the echo spectrum from the speech spectrum.
 8. The method ofclaim 1, wherein removing the prompt echo from the input signalcomprises: estimating a time delay between the input signal and theprompt echo; and removing the prompt echo from the input signal based onthe time delay.
 9. The method of claim 8, wherein removing prompt echocomprises: generating a prompt echo spectrum; calculating a Fast FourierTransform using the input signal to obtain a speech spectrum;subtracting the prompt echo spectrum from the speech spectrum using theestimated time delay.
 10. The method of claim 9, further comprising:performing an inverse DCT to obtain Cepstrum coefficients; performing alogarithm on the speech spectrum; and performing Mel-scale warping onthe logarithm of the speech spectrum.
 11. A method, comprising: settingan energy threshold for a signal having a plurality of segments; anddetermining a start point of speech in the signal, comprising: measuringan energy of the signal for a first predetermined time; and determiningwhether the energy of the signal for the first predetermined time isgreater than the energy threshold.
 12. The method of claim 11, whereinsetting an energy threshold comprises: measuring a prompt echo energy;and setting the energy threshold above the prompt echo energy.
 13. Themethod of claim 11, further comprising determining an end point of thespeech.
 14. The method of claim 13, wherein determining the end point ofthe speech comprises: measuring the energy of the signal for a secondpredetermined time; and determining whether the energy-of the signal forthe second predetermined time is less than the energy threshold.
 15. Themethod of claim 14, further comprising: calculating a duration based onthe start point and the end point; and determining whether the durationis greater than a predetermined duration threshold.
 16. The method ofclaim 14, wherein the first and second predetermined times are the same.17. The method of claim 13, wherein determining the end point of thespeech comprises: measuring the energy of the signal for a secondpredetermined time; calculating a rate of energy crossing over theenergy threshold; and determining whether the rate is greater than apredetermined rate.
 18. The method of claim 17, further comprising:calculating a duration based on the start point and the end point; anddetermining whether the duration is greater than a predeterminedduration threshold.
 19. The method of claim 11, wherein the determiningwhether the energy of the signal for the first predetermined time isgreater than the energy threshold is performed over a running window oftime.
 20. The method of claim 16, wherein the first predetermined numberof segments is eight and the running window of segments is ten.
 21. Themethod of claim 20, wherein a segment of the plurality of segments is 20milliseconds.
 22. A machine readable medium having stored thereoninstructions, which when executed by a processor, cause the processor toperform the following, comprising: detecting the existence of speech inan input signal; and removing a prompt echo from the input signal usingspectrum subtraction.
 23. The machine readable medium of claim 22,wherein removing the prompt echo from the input signal causes theprocessor to perform the following, comprising: estimating a time delaybetween the input signal and the prompt echo; obtaining a speechspectrum from the input signal, the speech spectrum including the promptecho; shifting the prompt echo according to the time delay; computing aprompt echo spectrum using the shifted echo; and subtracting the promptecho spectrum from the speech spectrum.
 24. The machine readable mediumof claim 22, wherein the existence of speech is detected based on apredetermined energy in a plurality of segments of the input signal anda predetermined duration of the plurality of segments having thepredetermined energy.
 25. A machine readable medium having storedthereon instructions, which when executed by a processor, cause theprocessor to perform the following, comprising: setting an energythreshold for a signal having a plurality of segments; and determining astart point of speech in the signal, comprising: measuring an energy ofthe signal for a first predetermined time; and determining whether theenergy of the signal for the first predetermined time is greater thanthe energy threshold.
 26. The machine readable medium of claim 25,wherein the processor further performs the following, comprising:determining an end point of speech in the signal, comprising: measuringthe energy of the signal for a second predetermined time; anddetermining whether the energy of the signal for the secondpredetermined time is less than the energy threshold.
 27. The machinereadable medium of claim 25, wherein the processor further performs thefollowing, comprising: calculating a duration based on the start pointand the end point; and determining whether the duration is greater thana predetermined duration threshold.
 28. An apparatus, comprising: avoice processing module; and a voice interface device coupled to thevoice processing module, the voice interface device comprising: a voiceinput device; and a voice output device coupled to the voice inputdevice to support full-duplex data transfer between the voice interfacedevice and a telephony system.
 29. The apparatus of claim 28, whereinthe voice processing module comprises a digital signal processor. 30.The apparatus of claim 28, wherein the voice processing module comprisesprocessing software and wherein the processing software comprises: aspeech detection module; a feature extraction module; an automaticspeech recognition engine; and a prompt generation module.